By: Martin Jauquet
Final Project - CMSC320 - Spring 2021 - Jose Calderon
Introduction
Baltimore City, the greatest city in America (according to the benches), has a rich history and a unique culture. Throughout its history, Baltimore has attracted many migrants from the Europeans at the turn of the 20th century to more current Middle Eastern and Latino demographics. These groups kept their traditions alive in their own neighborhoods in the City: Little Italy for the Italians, Greektown for the Greeks, and Highlandtown for Latinos to name a few. A strong sense of neighborhood pride and loyalty has developed that many Baltimoreans refer to their neighborhood when being asked where they are from.In recent years, Baltimore has received considerable national attention, albeit, not always in a good way. Crime, vacant houses, and failing schools are often what people think about when they think of Baltimore. Some neighborhoods have it worse than others, but why? Is there one factor that leads to higher crime and lower graduation rates? Or is it a combination of everything which makes it hard for some of these neighborhoods to have a decent standard of living?
In this tutorial, we will look at some of the data relating to income, high school completion, rates, vacant buildings and crime for the neighborhoods in the city. We sill do some analysis to determine if there is a correlation between any of these. Lastly, we will create an interactive map so it is easier to visualize the data.
conda install geopandas
# Imports
import folium
import geopandas
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter, StrMethodFormatter
First, we will be getting some data from Vital Signs Open Data Portal, an open data portal for Baltimore. They have tons of data broken up into various geographic regions. For this tutorial, we will be exploring data that is in Community Statistical Area(CSA), which is a group of neighborhoods with similar characteristics. We will be getting the following data sets: High School Completion Rate, High School Completion Rate, Part 1 Crime Rate per 1,000 Residents, Percentage of Residential Properties that are Vacant and Abandoned, and Median Household Income
On each site, find the download dropdown and download the spreadsheet to your local folder.
We will then open eash CSV and read it into a pandas dataframe, remove columns that we don't need, and rename other columns for readability.
# Load High school Completion Rate Data
city_education_data = pd.read_csv("High_School_Completion_Rate_CSA_City.csv")
city_education_data.drop(labels={'FID', 'SHAPE_Area', 'SHAPE_Length'}, axis='columns', inplace=True)
city_education_data.columns = ["CSA", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017"]
city_education_data.head()
| CSA | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Allendale/Irvington/S. Hilton | 80.34934 | 83.94 | 78.13953 | 82.68156 | 86.294416 | 77.647059 | 75.510204 | 75.141243 |
| 1 | Beechfield/Ten Hills/West Hills | 85.86957 | 80.34 | 82.05128 | 89.25620 | 89.423077 | 86.792453 | 86.111111 | 68.000000 |
| 2 | Belair-Edison | 82.85714 | 79.91 | 77.39464 | 82.18623 | 81.568627 | 80.000000 | 84.234234 | 79.500000 |
| 3 | Brooklyn/Curtis Bay/Hawkins Point | 78.46154 | 75.38 | 70.50360 | 74.78261 | 78.000000 | 69.411765 | 74.747475 | 78.448276 |
| 4 | Canton | 75.00000 | 66.67 | 100.00000 | 80.00000 | 80.000000 | 75.000000 | 80.000000 | 50.000000 |
# Load Vacant Residential Building Rate Data
city_building_data = pd.read_csv("Vacant_Residential_Buildings.csv")
city_building_data.drop(labels={'OBJECTID', 'Shape__Area', 'Shape__Length'}, axis='columns', inplace=True)
city_building_data.columns = ["CSA", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020"]
city_building_data.head()
| CSA | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Allendale/Irvington/S. Hilton | 4.31 | 4.69 | 5.095427 | 5.244253 | 5.743425 | 6.011809 | 5.141343 | 5.695201 | 5.994024 | 6.678261 | 6.469565 |
| 1 | Beechfield/Ten Hills/West Hills | 0.28 | 0.41 | 0.750208 | 0.416782 | 0.609081 | 0.885936 | 0.775838 | 0.970067 | 1.025499 | 0.775623 | 0.554017 |
| 2 | Belair-Edison | 1.45 | 1.67 | 2.002543 | 2.541700 | 2.731893 | 3.049555 | 2.811755 | 3.749603 | 3.638386 | 3.432931 | 3.337572 |
| 3 | Brooklyn/Curtis Bay/Hawkins Point | 3.74 | 4.16 | 4.981203 | 5.451128 | 5.930807 | 6.166157 | 6.276446 | 7.050529 | 7.826087 | 7.956939 | 8.073953 |
| 4 | Canton | 0.92 | 0.67 | 0.770186 | 0.571571 | 0.471113 | 0.396727 | 0.272480 | 0.421000 | 0.470530 | 0.470763 | 0.396432 |
# Load Crime Rate Data
city_crime_data = pd.read_csv("Part_1_Crime_Rate_per_1000_Residents.csv")
city_crime_data.drop(labels={'OBJECTID', 'Shape__Area', 'Shape__Length'}, axis='columns', inplace=True)
city_crime_data.columns = ["CSA", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020"]
city_crime_data.head()
| CSA | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Allendale/Irvington/S. Hilton | 40.574706 | 46.186101 | 45.446137 | 47.296047 | 55.1273 | 59.320466 | 49.577604 | 61.108713 | 62.403650 | 60.862058 | 47.481038 |
| 1 | Beechfield/Ten Hills/West Hills | 33.594260 | 36.121983 | 36.611220 | 36.040444 | 46.5590 | 37.345075 | 39.546641 | 44.439008 | 45.662100 | 38.568167 | 30.251142 |
| 2 | Belair-Edison | 50.298576 | 57.073955 | 52.652733 | 57.361047 | 56.6146 | 52.537896 | 52.767570 | 61.897106 | 53.226918 | 49.896647 | 38.068443 |
| 3 | Brooklyn/Curtis Bay/Hawkins Point | 81.654146 | 79.056379 | 62.135786 | 61.293267 | 54.9042 | 61.644317 | 80.390367 | 98.504529 | 85.445482 | 62.837885 | 52.657446 |
| 4 | Canton | 60.987654 | 64.814815 | 57.901235 | 56.419753 | 46.5432 | 51.234568 | 45.555556 | 43.950617 | 58.641975 | 52.222222 | 36.543210 |
# Load Median Household Income Data
city_income_data = pd.read_csv("Median_Household_Income_CSA_City.csv")
city_income_data.drop(labels={'OBJECTID', 'Shape__Area', 'Shape__Length'}, axis='columns', inplace=True)
city_income_data.columns = ["CSA", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]
city_income_data.head()
| CSA | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Allendale/Irvington/S. Hilton | 33563.12 | 33504.324121 | 33177.658915 | 38129.073308 | 35958.253351 | 36701.906742 | 37302.171053 | 39495.628472 | 38535.562176 | 43019.75792 |
| 1 | Beechfield/Ten Hills/West Hills | 50780.92 | 50439.739513 | 50135.121622 | 49807.861765 | 52622.688525 | 51537.582075 | 53565.079698 | 57572.502747 | 58055.306613 | 55017.77971 |
| 2 | Belair-Edison | 42920.83 | 45149.372510 | 46743.281847 | 43903.901338 | 38905.924198 | 38173.968254 | 40482.359649 | 39624.482085 | 42633.619512 | 46703.93468 |
| 3 | Brooklyn/Curtis Bay/Hawkins Point | 32888.50 | 33644.052189 | 33526.364650 | 34419.965251 | 35861.896552 | 36679.053435 | 38603.930233 | 40275.275330 | 39936.512500 | 39162.13858 |
| 4 | Canton | 74685.14 | 82129.569175 | 84978.141631 | 90862.712924 | 91735.652736 | 95362.400309 | 103281.832192 | 111891.251825 | 116911.088235 | 128460.48210 |
Since not all the tables have data for each year, we will only look at the years 2010-2017.
# Remove columns we can't compare across all tables
city_building_data.drop(columns=['2018', '2019', '2020'], inplace=True)
city_crime_data.drop(columns=['2018', '2019', '2020'], inplace=True)
city_income_data.drop(columns=['2018', '2019'], inplace=True)
Now that we have collected our data, let's tidy it up. We'll create a new column called "Year" and another column for the variable of each table. This is called melting. In the end, each table will have one row per entry for one measured variable. This makes it easier to do an analysis later on.
edu_melt = pd.melt(city_education_data, id_vars=["CSA"], var_name="Year", value_name="HS_Compl_Rate")
building_melt = pd.melt(city_building_data, id_vars=["CSA"], var_name="Year", value_name="Vacant_Rate")
crime_melt = pd.melt(city_crime_data, id_vars=["CSA"], var_name="Year", value_name="Crime_Rate")
income_melt = pd.melt(city_income_data, id_vars=["CSA"], var_name="Year", value_name="Income")
# Display one for example
edu_melt
| CSA | Year | HS_Compl_Rate | |
|---|---|---|---|
| 0 | Allendale/Irvington/S. Hilton | 2010 | 80.349340 |
| 1 | Beechfield/Ten Hills/West Hills | 2010 | 85.869570 |
| 2 | Belair-Edison | 2010 | 82.857140 |
| 3 | Brooklyn/Curtis Bay/Hawkins Point | 2010 | 78.461540 |
| 4 | Canton | 2010 | 75.000000 |
| ... | ... | ... | ... |
| 435 | Southwest Baltimore | 2017 | 75.151515 |
| 436 | The Waverlies | 2017 | 75.438596 |
| 437 | Upton/Druid Heights | 2017 | 77.142857 |
| 438 | Washington Village/Pigtown | 2017 | 74.285714 |
| 439 | Westport/Mount Winans/Lakeland | 2017 | 77.631579 |
440 rows × 3 columns
Now that the data is in an organized form, we will merge all the data into a single table. We will be using a table merging using a Left Join approach. This will return all the rows for which a value in the Left dataframe has a value in the right dataframe.
For example, this first merge takes the "CSA" in the income_melt table and sees if there is any data in the edu_melt table that has the same "CSA". If it does, then that new value is added as a column to the left table (in this case income_melt)
# Use a couple merges for the 4 tables
city_var_relation = income_melt.merge(edu_melt, on=["CSA", "Year"])
city_var_relation = city_var_relation.merge(crime_melt, on=["CSA", "Year"])
city_var_relation = city_var_relation.merge(building_melt, on=["CSA", "Year"])
# Round data to 2 decimal points
city_var_relation = city_var_relation.round(2)
city_var_relation
| CSA | Year | Income | HS_Compl_Rate | Crime_Rate | Vacant_Rate | |
|---|---|---|---|---|---|---|
| 0 | Allendale/Irvington/S. Hilton | 2010 | 33563.12 | 80.35 | 40.57 | 4.31 |
| 1 | Beechfield/Ten Hills/West Hills | 2010 | 50780.92 | 85.87 | 33.59 | 0.28 |
| 2 | Belair-Edison | 2010 | 42920.83 | 82.86 | 50.30 | 1.45 |
| 3 | Brooklyn/Curtis Bay/Hawkins Point | 2010 | 32888.50 | 78.46 | 81.65 | 3.74 |
| 4 | Canton | 2010 | 74685.14 | 75.00 | 60.99 | 0.92 |
| ... | ... | ... | ... | ... | ... | ... |
| 435 | Southwest Baltimore | 2017 | 25427.84 | 75.15 | 96.28 | 29.75 |
| 436 | The Waverlies | 2017 | 39098.02 | 75.44 | 79.58 | 4.73 |
| 437 | Upton/Druid Heights | 2017 | 20467.70 | 77.14 | 84.22 | 30.02 |
| 438 | Washington Village/Pigtown | 2017 | 38851.69 | 74.29 | 131.93 | 6.54 |
| 439 | Westport/Mount Winans/Lakeland | 2017 | 36645.24 | 77.63 | 95.80 | 6.95 |
440 rows × 6 columns
At last, we have all of our data in one table. Now that it is in a clean and readable format, let's visualize it and see if we find any trends.
First, let's just look some boxplots for each data set. Boxplots display the distribution for a column using the min, max, median, first quartile, and third quartile. This will help us get a general idea of the distribution is like for each variable. We will also be able to see any outliers that we may want to remove later on.
# Boxplot for median income
income_dist_plot = city_var_relation.boxplot(column="Income", by="Year")
income_dist_plot.yaxis.set_major_formatter(StrMethodFormatter('${x:,}'))
income_dist_plot.set_xlabel("Year")
income_dist_plot.set_ylabel("Median Income")
income_dist_plot.set_title("Distribution of Median Household Income per Year")
plt.suptitle('') # removes automatic title
plt.show()
There is a slight skew here to higher income and more outliers appear in recent years.
# Boxplot for HS completion
hscompl_dist_plot = city_var_relation.boxplot(column="HS_Compl_Rate", by="Year")
hscompl_dist_plot.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
hscompl_dist_plot.set_xlabel("Year")
hscompl_dist_plot.set_ylabel("HS Completion Rate")
hscompl_dist_plot.set_title("Distribution of High School Completion Rate per Year")
plt.suptitle('') # removes automatic title
plt.show()
This has a more normal distribution, yet there are still a good number of outliers on both ends.
# Boxplot for crime rate
crime_dist_plot = city_var_relation.boxplot(column="Crime_Rate", by="Year")
crime_dist_plot.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
crime_dist_plot.set_xlabel("Year")
crime_dist_plot.set_ylabel("Crime Rate per 1,000 Residents")
crime_dist_plot.set_title("Distribution of Crime Rate per Year")
plt.suptitle('') # removes automatic title
plt.show()
There are some extreme outliers here that will certainly impact the data. We need to be carefull with how we analyze them later on.
# Boxplot for vacant building
building_dist_plot = city_var_relation.boxplot(column="Vacant_Rate", by="Year")
building_dist_plot.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
building_dist_plot.set_xlabel("Year")
building_dist_plot.set_ylabel("Vacant Residential Building Rate")
building_dist_plot.set_title("Distribution of Vacant Residential Building Rate per Year")
plt.suptitle('') # removes automatic title
plt.show()
There are a decent number of outliers here which is skewing the data as you can see by the extended whisker. It will be interesting to see how much of an impact those data points have on our linear models.
In the next section, we will try to determine if there are any relations between different variables. We will do this through scatter plots and some linear regression. We will be using sklearn for most of the plots.
A little bit about linear regression before we go into the plots. Linear regressions (you may know them as lines of best fit) creates a model that is the best represenation to the data. A linear relationship does not necessarily mean there is a causation, just that there is a correlation(2 things are related). If we have a positive correlation, then the line is increasing. If it is a negative correlation, the line is decreasion. We say that there is a strong linear relaionship if the slope is close to 1.
# To be used in plotting linear regression line
# Extract columns and reshape into nD array
income = np.array(city_var_relation['Income']).reshape(-1, 1)
hs_compl_rate = np.array(city_var_relation['HS_Compl_Rate']).reshape(-1, 1)
crime_rate = np.array(city_var_relation['Crime_Rate']).reshape(-1, 1)
vacant_rate = np.array(city_var_relation['Vacant_Rate']).reshape(-1, 1)
First, we will be exploring how income correlates with High School Completion Rate, Crime Rate, and Vacant Buildings Rate. Income is often a huge factor in the quality of life, so we will be seeing how income levels relate to the other variables.
# Scatter plot
income_compl_plot = city_var_relation.plot(y="HS_Compl_Rate", x="Income", kind="scatter")
income_compl_plot.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
income_compl_plot.xaxis.set_major_formatter(StrMethodFormatter('${x:,}'))
income_compl_plot.set_xlabel("Median Income")
income_compl_plot.set_ylabel("HS Completion Rate")
income_compl_plot.set_title("HS Completion Rate vs Median Income")
# Create and plot linear regression line
reg = linear_model.LinearRegression().fit(income, hs_compl_rate)
plt.plot(income, reg.intercept_ + reg.coef_ * income, '-', color="red")
print("Slope:", reg.coef_, "; Intercept:", reg.intercept_)
plt.suptitle('') # removes automatic title
plt.show()
Slope: [[8.1816741e-05]] ; Intercept: [75.46670219]
For this plot, there is a slight positive correlation which means that income and HS completion is some what related.
# Scatter plot
income_compl_plot = city_var_relation.plot(y="Crime_Rate", x="Income", kind="scatter")
income_compl_plot.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
income_compl_plot.xaxis.set_major_formatter(StrMethodFormatter('${x:,}'))
income_compl_plot.set_xlabel("Median Income")
income_compl_plot.set_ylabel("Crime Rate")
income_compl_plot.set_title("Crime Rate vs Median Income")
# Create and plot linear regression line
lm = linear_model.LinearRegression()
reg = lm.fit(income, crime_rate)
plt.plot(income, reg.intercept_ + reg.coef_ * income, '-', color="red")
print("Slope:", reg.coef_, "; Intercept:", reg.intercept_)
plt.suptitle('') # removes automatic title
plt.show()
Slope: [[-0.00043202]] ; Intercept: [87.46904753]
Here we have a slight negative correlation. This makes sense since most wealthy communites have a lot less crime than poorer communitites. There are some outliers which have a significant crime rate.
For the following plot, we will try adjust our linear regression to fit an exponential function. Since the data drops fairly sharply and then levels out, it looks as though an exponential model could fit.
# Exponential possibly
# Scatter plot
income_compl_plot = city_var_relation.plot(y="Vacant_Rate", x="Income", kind="scatter")
income_compl_plot.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
income_compl_plot.xaxis.set_major_formatter(StrMethodFormatter('${x:,}'))
income_compl_plot.set_xlabel("Median Income")
income_compl_plot.set_ylabel("Vacant Res. Building Rate")
income_compl_plot.set_title("Vacant Res. Building Rate vs Median Income")
# Create and plot linear regression line
"""
lm = linear_model.LinearRegression()
reg = lm.fit(income, np.log(vacant_rate)
plt.plot(income, np.exp(reg.coef_ * income) + np.exp(reg.intercept_), '-', color="red")
print(reg.coef_)
print(reg.intercept_)
"""
x1 = city_var_relation['Income'].values
y1 = city_var_relation['Vacant_Rate'].values
log_vacant = []
for val in y1:
if val == 0:
log_vacant.append(np.log(1))
else:
log_vacant.append(np.log(val))
x1.sort()
x1 = x1[::-1]
line = np.polyfit(x1, log_vacant, 1)
plt.plot(x1, np.exp(line[1]) * np.exp(x1 * line[0]), '-', color="red")
plt.suptitle('') # removes automatic title
plt.show()
Well, it doesn't look like the exponential model quite worked. Using the regular way gives us a very odd plot which makes it hard to read. Here, the exponential model does not do well with larger numbers and does not provide helpful insight either.
In the last two plots, we will see how Vacancy rate and crime rate have an impact on High School completion rate. Graduating high school opens the doors to many oportunities and is a strong indicator of the wellbeing of a community.
# Scatter plot
income_compl_plot = city_var_relation.plot(y="HS_Compl_Rate", x="Vacant_Rate", kind="scatter")
income_compl_plot.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
income_compl_plot.xaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
income_compl_plot.set_xlabel("Vacant Res. Building Rate")
income_compl_plot.set_ylabel("HS Completion Rate")
income_compl_plot.set_title("HS Completion Rate vs Vacant Res. Building Rate")
# Create and plot linear regression line
lm = linear_model.LinearRegression()
reg = lm.fit(vacant_rate, hs_compl_rate)
plt.plot(vacant_rate, reg.intercept_ + reg.coef_ * vacant_rate, '-', color="red")
print("Slope:", reg.coef_, "; Intercept:", reg.intercept_)
plt.suptitle('') # removes automatic title
plt.show()
Slope: [[-0.17639053]] ; Intercept: [80.48567947]
We have a slight negative correlation here. The data is too cluttered towards the left, it is hard to tell how strong or a correlation vacant houses in a neighborhood has with HS completion rate.
# Scatter plot
income_compl_plot = city_var_relation.plot(y="HS_Compl_Rate", x="Crime_Rate", kind="scatter")
income_compl_plot.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
income_compl_plot.xaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
income_compl_plot.set_xlabel("Crime Rate")
income_compl_plot.set_ylabel("HS Completion Rate")
income_compl_plot.set_title("HS Completion Rate vs Crime Rate")
# Create and plot linear regression line
lm = linear_model.LinearRegression()
reg = lm.fit(crime_rate, hs_compl_rate)
plt.plot(crime_rate, reg.intercept_ + reg.coef_ * crime_rate, '-', color="red")
print("Slope:", reg.coef_, "; Intercept:", reg.intercept_)
plt.suptitle('') # removes automatic title
plt.show()
Slope: [[-0.02145728]] ; Intercept: [80.59834383]
Lastly, we looked a Crime vs. High School completion rate. This plot does not tell us very much and there is a lot of noise (interference) by the data points to get a clear linear model.
Now, let's look at the average value across the 8 years for each CSA. This will hopefully help us gain some better insights and remove extra noise from our models.
# Lets find average for each neighborhood
neighborhood_means = city_var_relation.groupby('CSA').mean()
x = neighborhood_means['Income'].values
y = neighborhood_means['HS_Compl_Rate'].values
# Create scatter plot
fig, ax = plt.subplots()
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
ax.scatter(x, y)
# Format plot axes
ax.xaxis.set_major_formatter(StrMethodFormatter('${x:,}'))
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
X = np.reshape(x, (x.size, 1))
Y = np.array(y)
# Create and plot linear regression line
lm = linear_model.LinearRegression()
reg = lm.fit(X, Y)
plt.plot(X, reg.intercept_ + reg.coef_ * X, '-', color="red")
print("Slope:", reg.coef_, "; Intercept:", reg.intercept_)
# Add titles to plot
ax.set_title('Average HS Completion Rate vs Average Median Income by Neighborhood')
ax.set_xlabel('Average Median Income')
ax.set_ylabel('Average HS Completion Rate')
plt.show()
Slope: [8.79026065e-05] ; Intercept: 75.19366161236813
We can see there is a positive correlation betwen median income and high school completion. This is reasonable, since people with more income can probably afford to go to a good school and receive extra help to pass compared to those who don't have expendable income.
# Lets find average for each neighborhood
neighborhood_means = city_var_relation.groupby('CSA').mean()
neighborhood_means = neighborhood_means[neighborhood_means['Crime_Rate'] < 150]
x = neighborhood_means['Income'].values
y = neighborhood_means['Crime_Rate'].values
# Create scatter plot
fig, ax = plt.subplots()
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
ax.scatter(x, y)
# Format plot axes
ax.xaxis.set_major_formatter(StrMethodFormatter('${x:,}'))
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
X = np.reshape(x, (x.size, 1))
Y = np.array(y)
# Create and plot linear regression line
lm = linear_model.LinearRegression()
reg = lm.fit(X, Y)
plt.plot(X, reg.intercept_ + reg.coef_ * X, '-', color="red")
print("Slope:", reg.coef_, "; Intercept:", reg.intercept_)
# Add titles to plot
ax.set_title('Average Crime Rate vs Average Median Income by Neighborhood ')
ax.set_xlabel('Average Median Income')
ax.set_ylabel('Average Crime Rate')
plt.show()
Slope: [-0.00041754] ; Intercept: 82.4377653287491
Again, here we have a negative correlation between these two variables. Higher income often tends to bring more security and less crime happens in those types of neighborhoods.
In the next plot, we will again attempt to add a exponential model.
# Lets find average for each neighborhood
neighborhood_means = city_var_relation.groupby('CSA').mean()
x = neighborhood_means['Income'].values
y = neighborhood_means['Vacant_Rate'].values
# Create scatter plot
fig, ax = plt.subplots()
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
ax.scatter(x, y)
# Format plot axes
ax.xaxis.set_major_formatter(StrMethodFormatter('${x:,}'))
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
"""
X = np.reshape(x, (x.size, 1))
Y = np.array(y)
# Create and plot linear regression line
lm = linear_model.LinearRegression()
reg = lm.fit(X, Y)
plt.plot(X, reg.intercept_ + reg.coef_ * X, '-', color="red")
"""
x.sort()
x = x[::-1]
line = np.polyfit(x, np.log(y), 1)
plt.plot(x, np.exp(line[1]) * np.exp(x * line[0]), '-', color="red")
# Add titles to plot
ax.set_title('Average Vacant Res. Building Rate vs Average Median Income by Neighborhood ')
ax.set_xlabel('Average Median Income')
ax.set_ylabel('Average Vacant Res. Building Rate')
plt.show()
Unfortunately, the model did not turn out as expected. This could also be because of the sensitiveity of the model to lower numbers.
# Lets find average for each neighborhood
# Remove outlier
neighborhood_means = city_var_relation.groupby('CSA').mean()
neighborhood_means = neighborhood_means[neighborhood_means['Crime_Rate'] < 150]
x = neighborhood_means['Crime_Rate'].values
y = neighborhood_means['HS_Compl_Rate'].values
# Create scatter plot
fig, ax = plt.subplots()
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
ax.scatter(x, y)
# Format plot axes
ax.xaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
X = np.reshape(x, (x.size, 1))
Y = np.array(y)
# Create and plot linear regression line
lm = linear_model.LinearRegression()
reg = lm.fit(X, Y)
plt.plot(X, reg.intercept_ + reg.coef_ * X, '-', color="red")
print("Slope:", reg.coef_, "; Intercept:", reg.intercept_)
# Add titles to plot
ax.set_title('Average HS Completion Rate vs Average Crime Rate by Neighborhood ')
ax.set_xlabel('Average Crime Rate')
ax.set_ylabel('Average HS Completion Rate')
plt.show()
Slope: [-0.1139682] ; Intercept: 86.26233339505902
Here we have a strong negative correlation between these two variables. Crime is certainly a barrier to a student's ability to learn. The more crime there is the less likely students are able to feel safe and comfortable enough to learn effectively.
# Lets find average for each neighborhood
neighborhood_means = city_var_relation.groupby('CSA').mean()
x = neighborhood_means['Vacant_Rate'].values
y = neighborhood_means['HS_Compl_Rate'].values
# Create scatter plot
fig, ax = plt.subplots()
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
ax.scatter(x, y)
# Format plot axes
ax.xaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:}%'))
X = np.reshape(x, (x.size, 1))
Y = np.array(y)
# Create and plot linear regression line
lm = linear_model.LinearRegression()
reg = lm.fit(X, Y)
plt.plot(X, reg.intercept_ + reg.coef_ * X, '-', color="red")
print("Slope:", reg.coef_, "; Intercept:", reg.intercept_)
# Add titles to plot
ax.set_title('Average HS Completion Rate vs Average Vacant Res. Building Rate by Neighborhood ')
ax.set_xlabel('Average Vacant Res. Building Rate')
ax.set_ylabel('Average HS Completion Rate')
plt.show()
Slope: [-0.18109997] ; Intercept: 80.52167749429275
Lastly, there is a stronger negative correlation between building vacancy and high school completion. Higher vacancy tends to mean more poverty which comes with a variety of barriers for young people to graduate.
Mapping can be a great way to visualize data. Here we will map the averages for each variable in a Heat Map.
There are many different ways to map data. For example, we could have done it by Zip Code, Census Tracts, Census Blocks. If you are interested, the Census has descriptions for each geography.
Through Folium, we will be mapping each CSA shape on a map of Baltimore City. This will allow us to look at different layers and see which CSAs have things in common.
First, we will get the shapes for each CSA from the same site as before. We will be downloading the shapefile. Make sure you unzip the entire directory into your local folder. We will also need to install geopandas to be able to convert the shapefile into a format that can be mapped.
# Get shapes for each Baltimore City Neighborhood
city_shapes = geopandas.read_file("Community_Statistical_Areas__CSAs___Reference_Boundaries.shp")
city_shapes.to_file('City_Shape.geojson', driver='GeoJSON')
# Remove columns we don't need and rename the ones we keep
city_shapes.drop(labels={"FID", "Link", "Tracts", "Neigh"}, axis= "columns", inplace=True)
city_shapes.columns = ["CSA", "geometry"]
city_shapes.head()
| CSA | geometry | |
|---|---|---|
| 0 | Allendale/Irvington/S. Hilton | POLYGON ((-8533446.900 4761284.000, -8533447.0... |
| 1 | Beechfield/Ten Hills/West Hills | POLYGON ((-8537625.100 4765025.100, -8537609.4... |
| 2 | Belair-Edison | POLYGON ((-8523467.900 4768528.400, -8523451.2... |
| 3 | Brooklyn/Curtis Bay/Hawkins Point | MULTIPOLYGON (((-8525811.500 4752203.100, -852... |
| 4 | Canton | POLYGON ((-8523889.000 4762493.800, -8523886.9... |
# Combine income table and shape table
geo_data_table = city_shapes.merge(neighborhood_means, on="CSA")
geo_data_table = geo_data_table.round(2)
geo_data_table.head()
| CSA | geometry | Income | HS_Compl_Rate | Crime_Rate | Vacant_Rate | |
|---|---|---|---|---|---|---|
| 0 | Allendale/Irvington/S. Hilton | POLYGON ((-8533446.900 4761284.000, -8533447.0... | 35979.02 | 79.96 | 50.58 | 5.24 |
| 1 | Beechfield/Ten Hills/West Hills | POLYGON ((-8537625.100 4765025.100, -8537609.4... | 52057.69 | 83.48 | 38.78 | 0.64 |
| 2 | Belair-Edison | POLYGON ((-8523467.900 4768528.400, -8523451.2... | 41988.01 | 80.96 | 55.15 | 2.50 |
| 3 | Brooklyn/Curtis Bay/Hawkins Point | MULTIPOLYGON (((-8525811.500 4752203.100, -852... | 35737.38 | 74.97 | 72.45 | 5.47 |
| 4 | Canton | POLYGON ((-8523889.000 4762493.800, -8523886.9... | 91865.84 | 75.83 | 53.42 | 0.56 |
# Create map
map = folium.Map(location=[39.29, -76.61], zoom_start=11)
# Create layer depicting median income levels
folium.Choropleth(
geo_data= geo_data_table,
name="Average Median Income",
data= geo_data_table,
columns=['CSA','Income'],
key_on="feature.properties.CSA",
fill_color="YlGn",
fill_opacity=0.8,
line_opacity=1,
).add_to(map)
folium.Choropleth(
geo_data= geo_data_table,
name="Average HS Comp. Rate",
data= geo_data_table,
columns=['CSA','HS_Compl_Rate'],
key_on="feature.properties.CSA",
fill_color="BuGn",
fill_opacity=0.8,
line_opacity=1,
).add_to(map)
folium.Choropleth(
geo_data= geo_data_table,
name="Average Crime Rate",
data= geo_data_table,
columns=['CSA','Crime_Rate'],
key_on="feature.properties.CSA",
fill_color="YlOrRd",
fill_opacity=0.8,
line_opacity=1,
).add_to(map)
folium.Choropleth(
geo_data= geo_data_table,
name="Average Vacant Res. Rate",
data= geo_data_table,
columns=['CSA','Vacant_Rate'],
key_on="feature.properties.CSA",
fill_color="PuRd",
fill_opacity=0.8,
line_opacity=1,
).add_to(map)
folium.LayerControl().add_to(map)
map
Determining the causes of Crime, Vacant buildings, and HS Completion rates is much more complicated than looking at some charts. The map helped us see where these problems are, but there are many socio-political factors that shape each neighborhood. While we were able to get some good insignt into the relationship between variables, each neighborhood is still just as valuable as any other. It does not matter what problems are going on where, just as long as we are working to solve them.
I love Baltimore. It is one of the greatest cities in the world. You have the Ravens, the Orioles, Old Bay, and Natty Boh. No matter what neighborhood you are in, everyone there can share the same love and pride for their city. Sure, there are some problems we need to solve, but I don't know if I could live anywhere else.
Thank you reading this analysis of the different neighborhoods of Baltimore. I hope you were able to learn something new. If you are interested in learing more about Baltimore and it's neighborhoods, check out this site